What influence the number of reviews and the rating on restaurant on TripAdvisor in Geneva

Authors

Nevenka Bustamante

Besart Demiri

Camila Heredia

Maxime Tourneau

Maissa Zetchi

Published

December 15, 2023

1. Introduction

Switzerland, Geneva.

1.1 Motivation

In today’s data-driven world, businesses and organizations face the ever-growing challenge of understanding the factors that influence customer behavior and satisfaction. The customer reviews play an important role in shaping the success of products and services, as they provide direct insights into consumer preferences and feedback. This project aims to address the critical need for predictive modeling in understanding the impact of measurable variables on the number of reviews. By quantifying the relationship between various factors such as price, product features, opening hours, and some distances, we can unlock valuable insights to make data-informed decisions. Ultimately, this analysis is motivated by the desire to help restaurants enhance their customer engagement, optimize their offerings, and drive growth by harnessing the power of data analytics.

Switzerland is known for its  high level of service excellence, whether in hotels, restaurants, or other service-oriented businesses, there is a strong emphasis on providing attentive, courteous, and efficient service to guests. In that way, we thought about Geneva.

Its hospitality industry is geared towards serving a diverse and often international clientele, leading to a cosmopolitan and inclusive atmosphere.It allows us to explore whether certain types of restaurants are more popular among its population, and if there is an important variation in customer preferences.

The Canton of Geneva is composed of different municipalities. We have selected only the ones which are directly dependent on the city of Geneva. As a first step, the idea is to work with them individually.

1.2 Research Questions

What variables influence the number of reviews and the rating on restaurant left on TripAdvisor in Geneva’s restaurant industry?

1.3 Exploratory Questions

  • Which gastronomy is the most popular?
  • What are the most recurrent features in restaurant descriptions?
  • On average, how many reviews and ratings does a restaurant have?
  • Average opening hours…

1.2.1 Data Presentation

Source: https://www.tripadvisor.com/Restaurants-g188057-Geneva.html. Our first and largest database is from Tripadvisor. This data provides us with information on the restaurant address, its coordinates with its postal code, the number of reviews, the rating, the type of cuisine etc..

1st Data Frame: Most relevants variables

Variable Description
Address Restaurant's address
Latitude Latitude of the restaurant
Longitude Longitude of the restaurant
Postal Code All the postal code in Geneva Canton
Number_of_reviews Number of reviews
rating Average rating of the restaurant
photoCount Number of photos given in the reviews
PriceRange Menu price range
Cuisines Different type of cuisine proposed
OpenHours1 Opening morning hour
CloseHours1 Closing morning hour
OpenHours2 Opening afternoon hour
CloseHours2 Closing afternoon hour
description Restaurant's description
Features Features offered
mealTypes Different meal type
TrainStation Latitude TrainStation latitude
TrainStation Longitude TrainStation longitude
rankingPosition Tripadvisor ranking

2nd Data Frame: Parking’s coordinates

Variable Description
Name Parking Name
Address Parking address
Latitude Parking latitude
Longitude Parking longitude
Postalcode Postalcode

3rd Dataframe : Public transport stop coordinates

Variable Description
Name Public stop Name
Address Public stop address
Latitude Public stop latitude
Longitude Public stop longitude
Postalcode Postalcode
Code
df3$OpenedHours <- df3$OpenedHours1 + df3$OpenedHours2
df3$OpenedHours1 <- NULL
df3$OpenedHours2 <- NULL

2. Geneva Restaurants

2.1 Geneva Map

Code
shapefile_data <- st_read(here::here("Data/Canton_Genève.shp"), quiet = TRUE)

# Extract the geometry information
geometry <- st_geometry(shapefile_data)

# Create a data frame without the geometry column
attributes_data <- st_drop_geometry(shapefile_data)

# Combine the geometry and attributes into a simple features data frame
sf_data <- st_sf(attributes_data, geometry = geometry)
shapefile_data <- st_transform(sf_data, crs = st_crs("+proj=longlat +datum=WGS84"))

Ge <- shapefile_data %>% filter(COMMUNE == 'Genève' | COMMUNE == 'Carouge (GE)')

##all together 

map1 <- leaflet(shapefile_data) %>%
  addTiles() %>%
  addPolygons(fillColor = "blue", fillOpacity = 0.5, color = "white", weight = 1, label = ~COMMUNE) %>%
  addPolygons(data = Ge, fillColor = "red", fillOpacity = 0.7, color = "white", weight = 2, label = ~COMMUNE)

map1

We decided to also include the Carouge district because many of our restaurants in our database were from this district.
Below is to give an idea of the location of each restaurant in Geneva:

Code
geo_cols <- c("latitude", "longitude", "address")
geo_df <- df3[, geo_cols]
geneva_coords <- c(46.2044, 6.1432)
# Create a leaflet map
map <- leaflet(geo_df) %>%
  addTiles() %>%
  addMarkers(
    clusterOptions = markerClusterOptions(),
    popup = ~as.character(address),
  ) %>%
  setView(lng = geneva_coords[2], lat = geneva_coords[1], zoom = 13)

map

2.2 Number of restaurants by Postalcode

Code
total_restaurants <- df3 %>%
  dplyr::group_by(Postalcode) %>%
  dplyr::summarize(TotalRestaurants = n())

barplot2 <- total_restaurants %>%
  plot_ly(x = ~Postalcode,
          y = ~TotalRestaurants,
          color = ~Postalcode,  # Use Postalcode as color variable
          colors = brewer.pal(9, "Set3"),  # Use Set3 palette with 9 colors
          type = "bar",
          name = ~Postalcode) %>%
  layout(title = "Number of restaurants by Postalcode") %>%
  layout(xaxis = list(title = "Postalcode", showgrid = FALSE))

barplot2

2.3 Frequency of cuisine type & meal type

Code
cuisinetext <- df3$Cuisines %>%
  tolower() %>%
  str_replace_all("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "") %>%
  str_replace_all("[[:punct:]]+", "") %>%
  str_replace_all("[[:digit:]]+", "") %>%
  str_trim() %>%
  str_replace_all("\\s+", " ") %>%
  tm::removeWords(stopwords("en")) 


cuisinetext1 <- cuisinetext %>% 
  tm::VectorSource() %>% 
  tm::Corpus() %>% 
  tm::TermDocumentMatrix()
tm::inspect(cuisinetext1)

tag_cuisine <- cuisinetext %>% 
  tokens() %>% 
  quanteda::dfm(., verbose = FALSE)
tidy_df <- tidytext::tidy(tag_cuisine)


tf_idf <- tidy_df %>%
  tidytext::bind_tf_idf(term, document, count) %>%
  arrange(desc(tf_idf))

tidy_words <- df3 %>%
  tidytext::unnest_tokens(word, Cuisines) %>% 
  mutate(word = SnowballC::wordStem(word)) %>%
  dplyr::select(word) %>%
  plyr::count() %>%
  arrange(desc(freq)) 

reject_words <- c("option", "brew", "barbecu", "grill", "soup", "friendli", "intern")

# Remove common words from the tidy_tweets data frame
filtered_words <- anti_join(tidy_words, data.frame(word = reject_words), by = "word")


wordcloud(words = filtered_words$word, freq = filtered_words$freq, min.freq=5, scale=c(3,0.5), colors=brewer.pal(8, "Dark2"))

Code
#wordcloud with the column Features

featurestext <- df3$Features %>%
  tolower() %>%
  str_replace_all("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "") %>%
  str_replace_all("[[:punct:]]+", "") %>%
  str_replace_all("[[:digit:]]+", "") %>%
  str_trim() %>%
  str_replace_all("\\s+", " ") %>%
  tm::removeWords(stopwords("en")) 


featurestext1 <- featurestext %>% 
  tm::VectorSource() %>% 
  tm::Corpus() %>% 
  tm::TermDocumentMatrix()
tm::inspect(featurestext1)

tag_features <- featurestext %>% 
  tokens() %>% 
  quanteda::dfm(., verbose = FALSE)
tidy_df1 <- tidytext::tidy(tag_features)


tf_idf2 <- tidy_df1 %>%
  tidytext::bind_tf_idf(term, document, count) %>%
  arrange(desc(tf_idf))

tidy_words <- df3 %>%
  tidytext::unnest_tokens(word, Features) %>% 
  mutate(word = SnowballC::wordStem(word)) %>%
  dplyr::select(word) %>%
  plyr::count() %>%
  arrange(desc(freq)) 


wordcloud(words = tidy_words$word, freq = tidy_words$freq, min.freq=5, scale=c(3,0.5), colors=brewer.pal(8, "Dark2"))

2.4 Distribution of the number of reviews and rating

Code
ggplot(df3, aes(x = Number_of_reviews)) +
  geom_histogram(binwidth = 50, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Number of Reviews",
       x = "Number of Reviews",
       y = "Frequency")

Code
ggplot(df3, aes(x = rating)) +
  geom_histogram(binwidth = 0.5, fill = "lightgreen", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Ratings",
       x = "Rating",
       y = "Frequency")


2.5 Distribution of the number of reviews and rating

Code
cuisine_cols <- c("French", "Italian", "European", "Vegetarian", "Vegan",
                   "Mediterranean", "Asian", "Gluten_free", "Spanish", "Swiss")
cuisine_df <- df3[, cuisine_cols]

# Melt the dataframe
melted_df <- melt(cuisine_df)

# Filter for rows where value is 1
filtered_df <- melted_df[melted_df$value == 1, ]

# Check if there are rows to plot
if (nrow(filtered_df) > 0) {
  # Create a bar plot for cuisine distribution
  ggplot(filtered_df, aes(x = variable, fill = factor(value))) +
    geom_bar(stat = "count", position = "dodge") +
    labs(title = "Main type of Cuisines Distribution",
         x = "Type of Cuisines",
         y = "Count") +
    scale_fill_manual(values = c("1" = "salmon"), guide = FALSE) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
} else {
  print("No data to plot.")
}

2.6 Correlation Matrix

Code
numeric_cols <- c("Number_of_reviews", "rating", "minPrice", "maxPrice", "rankingPosition",
                  "Distance_neareststop", "Distance_nearestparking","Distance_to_trainstation", 
                  "averaged_score_competition"
                  )

numeric_df3 <- df3[, numeric_cols]
cor_matrix <- cor(numeric_df3)

#ggcorrplot(cor_matrix, 
           #hc.order = TRUE,
           #type = "upper", # Type of plot: "full", "lower", or "upper"
           #outline.color = "white",
           #colors = c("#007000", "#FFBF00", "#AC0C0C"), 
           #lab_size = 2, 
           #lab = TRUE,
           #ggtheme = theme_minimal())




my_colors <- colorRampPalette(c("#007000", "#FFBF00", "#AC0C0C"))(100)
corrplot(cor_matrix, method = "color", col = my_colors)

Code
cuisine_cols <- c("Number_of_reviews", "rating", "French", "Italian", "European", "Vegetarian", "Vegan", "Mediterranean", "Asian", "Gluten_free", "Spanish", "Swiss" )

cuisine_df3 <- df3[, cuisine_cols]
cor_matrix1 <- cor(cuisine_df3)

corrplot(cor_matrix1, method = "color", col = my_colors)

Code
mealtype_cols <- c("Number_of_reviews", "rating", "Lunch", "Drinks", "Brunch", "Breakfast", "Dinner", "Late_Night_Drinks")

mealtype_df3 <- df3[, mealtype_cols]
cor_matrix3 <- cor(mealtype_df3)

corrplot(cor_matrix3, method = "color", col = my_colors)

2.7 Additional graphs

Code
selected_cols <- c("rating", "Number_of_reviews", "mealTypes", "Cuisines")
selected_df <- df3[, selected_cols]

# Split the "Mealtype" column into separate rows
selected_df <- selected_df %>%
  separate_rows(mealTypes, sep = " ")

# Create box plots for rating by meal type
ggplot(selected_df, aes(x = mealTypes, y = rating, fill = mealTypes)) +
  geom_boxplot() +
  labs(title = "Box Plot of Rating by Meal Type",
       x = "Meal Type",
       y = "Rating") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The boxplot illustrates that ratings for various meal types—Breakfast, Brunch, Dinner, Drinks, Late, Lunch, and Night—are consistently high with median values around 4.5, indicating general customer satisfaction. The interquartile ranges are narrow, showing low variability in ratings for each meal type, and the presence of outliers for some meals suggests occasional deviations from typical ratings. Overall, there is no significant difference in the central tendency of ratings among the different meal types, implying a uniform quality of experience.

Code
ggplot(selected_df, aes(x = `Number_of_reviews`, y = rating)) +
  geom_point(alpha = 0.6) + 
  labs(title = "Scatter Plot of Number of Reviews vs. Rating",
       x = "Number of Reviews",
       y = "Rating") +
  theme_minimal()

Code
df3$OpenedHours <- df3$OpenedHours1 + df3$OpenedHours2
df3$OpenedHours1 <- NULL
df3$OpenedHours2 <- NULL

3. Analysis

Code
###Selection of specific columns for the analysis

Bigdata <- df3 %>% dplyr::select(-address, -latitude, -longitude, -Postalcode, -minPrice, -maxPrice, -Cuisines, -OpenHours1, -CloseHours1, -OpenHours2, -CloseHours2, -City, -description, -Features, -mealTypes, -Trainstation_latitude, -Trainstation_longitude, -rankingString)

3.1 Relation between the individual variables

Code
variables_to_show <- c("Distance_to_trainstation", "Distance_nearestparking","Distance_neareststop",
                   "Distance_to_jet","Distance_to_catedral","Distance_to_patekmuseum",
                   "Distance_to_botanicgarden", "Distance_to_nationpalace", "Distance_to_brokenchair",
                   "Number_of_reviews", "rating", "averaged_price")

plot_matrix <-pairs(Bigdata[,variables_to_show], col = "blue", pch = 16)

Code
df3 %>%
  ggplot(aes(log(rankingPosition + 1),log(Number_of_reviews + 1))) +
  geom_point()+
  geom_smooth()+
  xlab("ranking")+
  ylab("Number of reviews")+
  theme_minimal()

The scatter plot shows a non-linear relationship between rankings and the number of reviews, with a peak in review quantity at mid-level rankings and fewer reviews at the extremes. The confidence interval indicates greater prediction uncertainty at the lowest and highest rankings.

Code
df3 %>% filter(df3$rating > 3) %>%
  ggplot(aes(rating,log(Number_of_reviews+1))) +
  geom_point()+
  geom_smooth()+
  xlab("rating")+
  ylab("Number of reviews")+
  theme_minimal()

The scatter plot displays a trend where the number of reviews peaks around a rating of 4.0, diminishes towards a rating of 4.5, and then slightly increases again at a perfect rating of 5.0. The confidence interval widens as the ratings approach the extremes, indicating more variability in the number of reviews for exceptionally high and low ratings.

Code
df3 %>% filter(df3$rating > 3) %>%
  ggplot(aes(rating,log(rankingPosition +1))) +
  geom_point()+
  geom_smooth()+
  xlab("rating")+
  ylab("Ranking Position")+
  theme_minimal()

Code
df3 %>% 
  ggplot(aes(log(photoCount + 1),log(Number_of_reviews + 1))) +
  geom_point()+
  geom_smooth()+
  xlab("Photo Count")+
  ylab("Number of reviews")+
  theme_minimal()

Code
df3 %>% filter(df3$averaged_price < 300) %>%
  ggplot(aes(averaged_price,Number_of_reviews)) +
  geom_point()+
  geom_smooth()+
  xlab("averaged price")+
  ylab("Number of reviews")+
  theme_minimal()

The scatter plot indicates that the number of reviews tends to be higher for items with lower average prices, with the number of reviews decreasing as the average price increases. The fitted line shows a slight negative trend, and the confidence interval becomes wider with increasing price, suggesting less certainty about the number of reviews for higher-priced items.

Code
df3 %>% 
  ggplot(aes(log(Distance_to_trainstation + 1), log(Number_of_reviews + 1))) +
  geom_point()+
  geom_smooth()+
  xlab("Number of reviews")+
  ylab("Distance Train station")+
  theme_minimal()

Code
df3 %>% filter(df3$Distance_nearestparking < 5000) %>%
  ggplot(aes(log(Distance_nearestparking + 1), log(Number_of_reviews + 1), )) +
  geom_point()+
  geom_smooth()+
  xlab("Number of reviews")+
  ylab("Distance parking")+
  theme_minimal()

Code
df3 %>% 
  ggplot(aes(log(Distance_to_catedral + 1),log(Number_of_reviews + 1))) +
  geom_point()+
  geom_smooth()+
  xlab("Number of reviews")+
  ylab("Distance to catedral")+
  theme_minimal()

Code
df3 %>% 
  ggplot(aes(log(Distance_neareststop + 1),log(Number_of_reviews + 1 ))) +
  geom_point()+
  geom_smooth()+
  xlab("Number of reviews")+
  ylab("Distance Nearest stop")+
  theme_minimal()

Code
df3 %>% 
  ggplot(aes(log(Distance_to_jet + 1), log(Number_of_reviews + 1))) +
  geom_point()+
  geom_smooth()+
  xlab("Number of reviews")+
  ylab("Distance jet d'eau ")+
  theme_minimal()

Code
df3 %>% 
  ggplot(aes(log(Distance_to_patekmuseum + 1),log(Number_of_reviews + 1))) +
  geom_point()+
  geom_smooth()+
  xlab("Number of reviews")+
  ylab("Distance Patek ")+
  theme_minimal()

Code
df3 %>% 
  ggplot(aes(log(Distance_to_nationpalace + 1),log(Number_of_reviews + 1))) +
  geom_point()+
  geom_smooth()+
  xlab("Number of reviews")+
  ylab("Distance ONU ")+
  theme_minimal()

Code
df3 %>% 
  ggplot(aes(log(Distance_to_brokenchair + 1),log(Number_of_reviews + 1))) +
  geom_point()+
  geom_smooth()+
  xlab("Number of reviews")+
  ylab("Distance Broken chair ")+
  theme_minimal()

Code
df3 %>% 
  ggplot(aes(log(Distance_to_botanicgarden + 1),log(Number_of_reviews + 1))) +
  geom_point()+
  geom_smooth()+
  xlab("Number of reviews")+
  ylab("Distance Botanic garden ")+
  theme_minimal()

Code
##Creation top 100 restaurants based on rating

sorted_df <- Bigdata[order(Bigdata$rating, decreasing = TRUE), ]
top_100_restaurants_rating <- head(sorted_df, 100)

##Creation worst 100 restaurants based on rating 

sorted_df1 <- Bigdata[order(Bigdata$rating), ]
worst_100_restaurants_rating <- head(sorted_df1, 100)


averages_df <- data.frame(
  Category = c("Best", "Worst"),
  AvgDistParking = c(mean(top_100_restaurants_rating$Distance_nearestparking), 
              mean(worst_100_restaurants_rating$Distance_nearestparking)),
  AvgDistToStop = c(mean(top_100_restaurants_rating$Distance_neareststop), 
                       mean(worst_100_restaurants_rating$Distance_neareststop)),
  AvgDistToTrain = c(mean(top_100_restaurants_rating$Distance_to_trainstation), 
                       mean(worst_100_restaurants_rating$Distance_to_trainstation)),
  AvgDistToJet = c(mean(top_100_restaurants_rating$Distance_to_jet), 
                       mean(worst_100_restaurants_rating$Distance_to_jet)),
  AvgDistToCatedral = c(mean(top_100_restaurants_rating$Distance_to_catedral), 
                       mean(worst_100_restaurants_rating$Distance_to_catedral)),
  AvgDistToPatek = c(mean(top_100_restaurants_rating$Distance_to_patekmuseum), 
                       mean(worst_100_restaurants_rating$Distance_to_patekmuseum)),
  AvgDistToBotanic = c(mean(top_100_restaurants_rating$Distance_to_botanicgarden), 
                       mean(worst_100_restaurants_rating$Distance_to_botanicgarden)),
  AvgDistToONU = c(mean(top_100_restaurants_rating$Distance_to_nationpalace), 
                       mean(worst_100_restaurants_rating$Distance_to_nationpalace)),
  AvgDistToBrokenchair = c(mean(top_100_restaurants_rating$Distance_to_brokenchair), 
                       mean(worst_100_restaurants_rating$Distance_to_brokenchair))
)
averages_long <- tidyr::gather(averages_df, key = "DistanceType", value = "AverageDistance", -Category)

##barplot with the distance but the restaurant are ranked according to their rating
bar_plot <- ggplot(averages_long, aes(x = DistanceType, y = AverageDistance, fill = Category)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
  labs(title = "Average Distance to Location - Best vs Worst Restaurants",
       x = "Distance Type",
       y = "Average Distance") +
  scale_fill_manual(values = c("Best" = "green", "Worst" = "red")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(bar_plot)


We wanted to know if there was a significant difference in the distance to the parking, the public or the train station between the best and worst restaurants.As we can see, there is no significant difference.

Code
##Creation top 100 restaurants based on number of reviews

sorted_df1.1 <- Bigdata[order(Bigdata$Number_of_reviews, decreasing = TRUE), ]
top_100_restaurants_reviews <- head(sorted_df1.1, 100)

##Creation worst 100 restaurants based on number of reviews

sorted_df1.2 <- Bigdata[order(Bigdata$Number_of_reviews), ]
worst_100_restaurants_reviews <- head(sorted_df1.2, 100)


averages_df1 <- data.frame(
  Category = c("Highest", "Lowest"),
  AvgDistParking = c(mean(top_100_restaurants_reviews$Distance_nearestparking), 
              mean(worst_100_restaurants_reviews$Distance_nearestparking)),
  AvgDistToStop = c(mean(top_100_restaurants_reviews$Distance_neareststop), 
                       mean(worst_100_restaurants_reviews$Distance_neareststop)),
  AvgDistToTrain = c(mean(top_100_restaurants_reviews$Distance_to_trainstation), 
                       mean(worst_100_restaurants_reviews$Distance_to_trainstation)),
  AvgDistToJet = c(mean(top_100_restaurants_reviews$Distance_to_jet), 
                       mean(worst_100_restaurants_reviews$Distance_to_jet)),
  AvgDistToCatedral = c(mean(top_100_restaurants_reviews$Distance_to_catedral), 
                       mean(worst_100_restaurants_reviews$Distance_to_catedral)),
  AvgDistToPatek = c(mean(top_100_restaurants_reviews$Distance_to_patekmuseum), 
                       mean(worst_100_restaurants_reviews$Distance_to_patekmuseum)),
  AvgDistToBotanic = c(mean(top_100_restaurants_reviews$Distance_to_botanicgarden), 
                       mean(worst_100_restaurants_reviews$Distance_to_botanicgarden)),
  AvgDistToONU = c(mean(top_100_restaurants_reviews$Distance_to_nationpalace), 
                       mean(worst_100_restaurants_reviews$Distance_to_nationpalace)),
  AvgDistToBrokenchair = c(mean(top_100_restaurants_reviews$Distance_to_brokenchair), 
                       mean(worst_100_restaurants_reviews$Distance_to_brokenchair))
)

averages_long1 <- tidyr::gather(averages_df1, key = "DistanceType", value = "AverageDistance", -Category)


##barplot with the distances but the restaurants are ranked according to their number of reviews
bar_plot1 <- ggplot(averages_long1, aes(x = DistanceType, y = AverageDistance, fill = Category)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
  labs(title = "Average Distance to Location - High vs Low Restaurants Reviews",
       x = "Distance Type",
       y = "Average Distance") +
  scale_fill_manual(values = c("Highest" = "green", "Lowest" = "red")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(bar_plot1)


We can see that the restaurants with the lowest number of reviews are far away from the Botanic garden, the ONU Palace and the train station.

3.2 Factor Analysis

First of all, we decided to make a PCA graph with all the distances variables that we had. The goal was to make a comparison a bit later, after having done the Principal Component Analysis to see which variables are contributing in a same way for a specific dimension.

Code
##Factor analysis with all distances (or just focusing on one) interesting

data_for_fa <- Bigdata %>%
                dplyr::select(Distance_to_trainstation, Distance_nearestparking,Distance_neareststop,
                             Distance_to_jet,Distance_to_catedral,Distance_to_patekmuseum,
                             Distance_to_botanicgarden, Distance_to_nationpalace, Distance_to_brokenchair)

myPCA <- FactoMineR:: PCA(data_for_fa)

Code
#fviz_pca_ind(myPCA,
             #geom.ind = "point",
             #col.ind = "cos2",
             #palette = "jco",
             #addEllipses = TRUE,
             #ellipse.type = "confidence",
             #repel = TRUE)
Code
# Principal component with all the distances variables

vectordistances <- c("Distance_to_trainstation", "Distance_nearestparking","Distance_neareststop",
"Distance_to_jet","Distance_to_catedral","Distance_to_patekmuseum","Distance_to_botanicgarden", "Distance_to_nationpalace", "Distance_to_brokenchair")

distances.pc <- prcomp(Bigdata[,vectordistances])
#summary(distances.pc)
#distances.pc$x

fviz_eig(distances.pc, geom="line")

Code
##Save the component in our df. Based on what we saw, we could rename by the distance that we really have
#Bigdata$distances1 <- distances.pc$x[,1]
#Bigdata$distances2 <- distances.pc$x[,2]
#Bigdata$distances3 <- distances.pc$x[,3]

The Principal Component Analysis (PCA) of distance-related features reveals insightful patterns in the data. The scree plot illustrates the variance captured by each principal component (PC). It is evident that the first three PCs contribute significantly to explaining the variability in the distances dataset. The eigenvalues associated with the PCs sharply decline after the third component, suggesting diminishing returns in explanatory power beyond this point

Code
var <- get_pca_var(distances.pc)
a<-fviz_contrib(distances.pc, "var",axes = 1)
b<-fviz_contrib(distances.pc, "var",axes = 2)
c<-fviz_contrib(distances.pc, "var",axes = 3)
grid.arrange(a,b,c,top='Contribution to the Principal Components')

As we can see on the graph above, the variables “Distance_to_botanicgarden”, “Distance_to_nationpalace” and “Distance_to_brokenchair” are all contributing positively to PC1. It suggests that an increase in these distances is associated with an increase in PC1. In order to reduce dimensionality, we chose one of those variables that represents the overall theme of the group, “Distance_to_nationpalace”.

In the same way, we can only select “Distance_to_catedral” for the dimension 2 and “Distance_to_jet” for the dimension 3.

Code
library(plotly)
library(scatterplot3d) # for PCA analysis

# Assuming distances.pc is the result of a PCA
# Perform PCA analysis here if not already done
# pca_result <- prcomp(your_data, scale. = TRUE)
# distances.pc <- pca_result

# Create a dataframe for the scatter plot of PCA scores
pca_scores <- data.frame(distances.pc$x[, 1:3])
names(pca_scores) <- c("PC1", "PC2", "PC3")

# Create a dataframe for the arrows
arrows <- data.frame(
  x = rep(0, nrow(distances.pc$rotation)),
  y = rep(0, nrow(distances.pc$rotation)),
  z = rep(0, nrow(distances.pc$rotation)),
  u = distances.pc$rotation[, 1],
  v = distances.pc$rotation[, 2],
  w = distances.pc$rotation[, 3]
)

# First plot the PCA scores
p <- plot_ly(data = pca_scores, x = ~PC1, y = ~PC2, z = ~PC3, type = 'scatter3d', mode = 'markers',
             marker = list(size = 2, color = 'blue')) %>%
  add_markers()

# Then add the arrows for each principal component loading
for(i in 1:nrow(arrows)) {
  p <- p %>% add_trace(
    type = "cone",
    x = c(0, arrows$x[i]),
    y = c(0, arrows$y[i]),
    z = c(0, arrows$z[i]),
    u = c(0, arrows$u[i]),
    v = c(0, arrows$v[i]),
    w = c(0, arrows$w[i]),
    anchor = "tail",
    showscale = FALSE,
    sizemode = "absolute",
    sizeref = 0.1,
    opacity = 0.6
  )
}

# Finalize the layout
p <- p %>% layout(
  scene = list(
    xaxis = list(title = 'Distance ONU'),
    yaxis = list(title = 'Distance Catedral'),
    zaxis = list(title = 'Distance Jet'),
    aspectmode = 'cube'
  ),
  title = "3D PCA Visualization"
)

# Show the plot
p


In the PCA plot of the variables, we see that the variables Distance_to_botanicgarden, Distance_to_brokenchair and Distance_to_nationpalace are correlated, as well as the variables Distance_neareststop, Distance_to_catedral and Distance_nearestparking.

In the graph representing the 3 clusters, dimension 1 is an average between the correlated variables Distance_to_botanicgarden, Distance_to_brokenchair and Distance_to_nationpalace; and dimension 2 is an average between the correlated variables Distance_neareststop, Distance_to_catedral and Distance_nearestparking.

Code
##Factor analysis with all distances (or just focusing on one) interesting

data_for_fa <- Bigdata %>%
                dplyr::select(Distance_to_trainstation, Distance_nearestparking,Distance_neareststop,Distance_to_jet,Distance_to_catedral, Distance_to_nationpalace)

myPCA <- FactoMineR:: PCA(data_for_fa)

Code
#fviz_pca_ind(myPCA,
             #geom.ind = "point",
             #col.ind = "cos2",
             #palette = "jco",
             #addEllipses = TRUE,
             #ellipse.type = "confidence",
             #repel = TRUE)

The PCA (Principal Component Analysis) graph visualizes the relative importance and contribution of various distance-related variables to the first two principal components. Dimension 1 (Dim 1) on the x-axis explains 40.75% of the variance and is strongly influenced by ‘Distance_to_jet’, ‘Distance_nearestparking’, and ‘Distance_neareststop’, suggesting these variables are correlated and may represent a similar aspect of the data. Dimension 2 (Dim 2) on the y-axis accounts for 29.35% of the variance and is most influenced by ‘Distance_to_nationalpalace’ and ‘Distance_to_trainstation’, indicating these are distinct factors that contribute differently to the dataset’s variance. We definitely could see the difference with the first one above.

3.3 Formation of clusters

Code
##Cluster according to all the distances that we have 
data_for_cluster <- Bigdata %>%
                dplyr::select(Distance_to_trainstation, Distance_nearestparking,Distance_neareststop,Distance_to_jet,Distance_to_catedral, Distance_to_nationpalace)

scaled_features <- scale(data_for_cluster)

result <- NbClust(scaled_features, distance = "euclidean", method = "kmeans", min.nc = 2, max.nc = 10, index = "all")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 6 proposed 2 as the best number of clusters 
* 8 proposed 3 as the best number of clusters 
* 1 proposed 5 as the best number of clusters 
* 6 proposed 6 as the best number of clusters 
* 1 proposed 9 as the best number of clusters 
* 2 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  3 
 
 
******************************************************************* 
Code
## 3 clusters 

kmeans_result <- kmeans(scaled_features, centers = 3, nstart = 25)

fviz_cluster_object <- fviz_cluster(kmeans_result, data = scaled_features,
                                    repel = TRUE, # To avoid text overlapping
                                    show.clust.cent = TRUE,
                                    palette = c("#AC0C0C","#FFBF00", "#007000"), 
                                    geom = "point", # Removed "text" to avoid clutter
                                    ellipse.type = "convex", 
                                    ggtheme = theme_bw()) +
                      geom_point(size = 2, alpha = 0) # Adjust size and transparency

# Now you can adjust the scale manually by changing the limits if needed
fviz_cluster_object <- fviz_cluster_object + 
                      xlim(c(-13, 5)) + 
                      ylim(c(-6, 5))

# Print the plot
print(fviz_cluster_object)

Here the k-means center for each cluster, depending on the variables choosen for making the cluster: ::: {.cell}

Code
result_table <- kmeans_result$centers

kable(result_table, "html") %>%
  kable_styling(full_width = FALSE)
Distance_to_trainstation Distance_nearestparking Distance_neareststop Distance_to_jet Distance_to_catedral Distance_to_nationpalace
-1.0055810 -0.2607063 0.015542 -0.2918398 0.2663589 -0.9165445
0.5213387 -0.0075770 -0.178913 -0.0211579 -0.4212910 0.6570413
1.5411635 2.0818411 1.989832 2.4798509 2.9294726 -0.7386820

:::

3.4 Regression Tree

3.4.1 Regression Tree with Number of reviews

Code
##Based on our dataset Bigdata that contains all the variables that we want to include in our model

###Tree made with the variable number of reviews
Bigdata1 <- Bigdata %>% dplyr::select(-c(photoCount, rating, Distance_to_patekmuseum, Distance_to_botanicgarden, Distance_to_botanicgarden, rawRanking ))

set.seed(123)
indices <- Bigdata1$Number_of_reviews %>%
  as.character() %>% 
  createDataPartition(
    p = 0.8, 
    list = FALSE)

train = Bigdata1[indices,]
validation = Bigdata1[-indices,]



Dtree1 = rpart(Number_of_reviews ~., 
               data = train, 
               control=list(cp=.01, xval=10),
               parms = list(split = "gini"))   
#summary(Dtree1)

rpart.plot(Dtree1)

3.4.2 Regression Tree with rating

Code
##we split our model excluding the variable number of reviews. Tree made with rating
Bigdata2 <- Bigdata %>% dplyr::select(-c(Number_of_reviews, photoCount,Distance_to_patekmuseum, Distance_to_botanicgarden, Distance_to_botanicgarden, rawRanking ))

set.seed(123)

indices <- Bigdata2$rating %>%
  as.character() %>% 
  createDataPartition(
    p = 0.8, 
    list = FALSE)

train = Bigdata2[indices,]
validation = Bigdata2[-indices,]

  
  
#### Gini

#Dtree1 = rpart(rating ~.,                
               #data = train, 
               #control=list(cp=.01, xval=10),
               #parms = list(split = "gini"))   # default split is Gini, write "information" otherwise. 
#summary(Dtree1)

# Plot tree 
#plot(Dtree1, margin = 0.05)
#text(Dtree1, use.n = TRUE, cex = 0.7)



##### Information gain

Dtree2 = rpart(rating ~., 
               data = train, 
               control=list(cp=.01, xval=10), 
               parms = list(split = "information")) # information gain based on entropy

# Plot trees
#par(mfrow = c(1,2), mar = rep(0.3, 4))
#plot(Dtree1, margin = 0.05);  text(Dtree1, use.n = TRUE, cex = 0.3)
#plot(Dtree2, margin = 0.05);  text(Dtree2, use.n = TRUE, cex = 0.3)


rpart.plot(Dtree2, extra = 101, under = TRUE, cex = 0.8)

3.5 Multiple Regression

3.5.1 Number of Reviews

Code
corr_matrixreviews <-
 Bigdata %>% cor(use = "complete.obs") %>% round(digits = 4)

corr_matrixreviews <- corr_matrixreviews %>% knitr::kable(caption = "Correlation matrix",
                      align = 'c',
                      digits = 3) %>%
  kableExtra:: kable_styling(c("striped", "bordered"),
                full_width = FALSE,
                position = "center")

corr_matrixreviews

We decided to calculate the following regression, which considers all the variables presented above, to determine the Number of reviews :

Number of Reviews =\beta_0 + \beta_1* rating + \beta_2 *photoCount + \beta_3*rankingPosition + \\ \beta_4*OpenedHours + \beta_5*Distancetrainstation + \beta_6 * Distanceneareststop + \\ \beta_7*Distancecatedral + \beta_8*DistanceJet +\beta_9*DistanceNationPalace + \\ \beta_{10}*Scorecompetition + \beta_{11}*Averagedprice + \beta_{11}*French + \beta_{12}*Italien \\ + \beta_{13}*European + \beta_{14}*Mediterranean +\beta_{15}*Vegeterian + \beta_{17}*Vegan \\ + \beta_{18}*Asian + \beta_{19}*Glutenfree + \beta_{20}*Spanish + \beta_{21}*Swiss + \beta_{22}*Lunch \\ + \beta_{23}*Brunch + \beta_{24}*Breakfast + \beta_{25}*Dinner + \beta_{26}*Drinks + \beta_{27}*LateNightDrinks

Code
completemodel <- lm(Number_of_reviews ~ rating + photoCount + rankingPosition + OpenedHours + averaged_score_competition+ averaged_price+ French + Italian + European +Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner +Drinks + Brunch + Breakfast + Late_Night_Drinks +log(Distance_to_trainstation)+ log(Distance_nearestparking) + log(Distance_neareststop) + log(Distance_to_jet)+ log(Distance_to_catedral) + log(Distance_to_nationpalace),Bigdata)
Code
summary(completemodel)

Call:
lm(formula = Number_of_reviews ~ rating + photoCount + rankingPosition + 
    OpenedHours + averaged_score_competition + averaged_price + 
    French + Italian + European + Vegetarian + Vegan + Mediterranean + 
    Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner + 
    Drinks + Brunch + Breakfast + Late_Night_Drinks + log(Distance_to_trainstation) + 
    log(Distance_nearestparking) + log(Distance_neareststop) + 
    log(Distance_to_jet) + log(Distance_to_catedral) + log(Distance_to_nationpalace), 
    data = Bigdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-913.56  -48.87   -3.01   35.83 1369.67 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    1.030e+03  2.217e+02   4.644 4.30e-06 ***
rating                        -1.114e+02  1.695e+01  -6.571 1.17e-10 ***
photoCount                     1.650e+00  7.267e-02  22.700  < 2e-16 ***
rankingPosition               -1.405e-01  3.761e-02  -3.735 0.000207 ***
OpenedHours                    4.361e+00  1.377e+00   3.167 0.001628 ** 
averaged_score_competition    -7.976e+00  3.507e+01  -0.227 0.820186    
averaged_price                 1.919e-03  4.067e-03   0.472 0.637162    
French                        -8.363e+00  1.666e+01  -0.502 0.615828    
Italian                        5.147e+00  1.808e+01   0.285 0.776008    
European                       8.624e+00  1.594e+01   0.541 0.588646    
Vegetarian                    -1.881e+01  1.485e+01  -1.267 0.205821    
Vegan                         -5.399e+00  1.691e+01  -0.319 0.749641    
Mediterranean                 -2.205e+01  1.821e+01  -1.211 0.226373    
Asian                          3.675e+00  1.919e+01   0.192 0.848184    
Gluten_free                   -5.093e+00  2.069e+01  -0.246 0.805692    
Spanish                        7.789e+00  3.848e+01   0.202 0.839688    
Swiss                          2.333e+00  2.056e+01   0.113 0.909718    
Lunch                          1.967e+01  2.392e+01   0.822 0.411157    
Dinner                        -2.195e+01  2.460e+01  -0.892 0.372638    
Drinks                        -5.093e+01  1.428e+01  -3.566 0.000395 ***
Brunch                         3.162e+01  2.299e+01   1.376 0.169501    
Breakfast                     -3.907e+01  1.919e+01  -2.036 0.042248 *  
Late_Night_Drinks              5.315e+01  1.990e+01   2.671 0.007800 ** 
log(Distance_to_trainstation)  3.698e+00  1.172e+01   0.316 0.752476    
log(Distance_nearestparking)  -1.102e+01  9.114e+00  -1.209 0.227226    
log(Distance_neareststop)     -9.324e+00  9.113e+00  -1.023 0.306684    
log(Distance_to_jet)           2.258e+01  1.531e+01   1.475 0.140761    
log(Distance_to_catedral)     -4.059e+01  1.282e+01  -3.167 0.001629 ** 
log(Distance_to_nationpalace) -3.218e+01  2.142e+01  -1.502 0.133716    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 142.7 on 543 degrees of freedom
Multiple R-squared:  0.6912,    Adjusted R-squared:  0.6753 
F-statistic: 43.41 on 28 and 543 DF,  p-value: < 2.2e-16
Code
#tab_model(completemodel)

The model has an overall good fit with a Multiple R-squared of 0.6912, indicating that approximately 69.12% of the variance in the number of reviews is explained by the included variables. The p-value (< 2.2e-16) of the F-statistic suggests that the model is statistically significant.


Then we decided to make another regression by kicking out the rating and photoCount variables:

Code
#completemodel1 <- lm(Number_of_reviews ~ ., dplyr::select(Bigdata, -rating, -photoCount))

completemodel1 <- lm(Number_of_reviews ~rankingPosition + OpenedHours + averaged_score_competition+ averaged_price+ French + Italian + European +Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner +Drinks + Brunch + Breakfast + Late_Night_Drinks +log(Distance_to_trainstation)+ log(Distance_nearestparking) + log(Distance_neareststop) + log(Distance_to_jet)+ log(Distance_to_catedral) + log(Distance_to_nationpalace),Bigdata) 

summary(completemodel1)

Call:
lm(formula = Number_of_reviews ~ rankingPosition + OpenedHours + 
    averaged_score_competition + averaged_price + French + Italian + 
    European + Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + 
    Spanish + Swiss + Lunch + Dinner + Drinks + Brunch + Breakfast + 
    Late_Night_Drinks + log(Distance_to_trainstation) + log(Distance_nearestparking) + 
    log(Distance_neareststop) + log(Distance_to_jet) + log(Distance_to_catedral) + 
    log(Distance_to_nationpalace), data = Bigdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-360.15  -96.54  -18.05   52.84 1814.73 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    1.366e+03  3.033e+02   4.505 8.14e-06 ***
rankingPosition               -3.842e-01  4.806e-02  -7.994 7.85e-15 ***
OpenedHours                    7.100e+00  1.986e+00   3.575 0.000381 ***
averaged_score_competition    -1.262e+02  5.025e+01  -2.512 0.012291 *  
averaged_price                 1.515e-03  5.866e-03   0.258 0.796234    
French                         1.111e+01  2.409e+01   0.461 0.644769    
Italian                        8.644e+00  2.614e+01   0.331 0.741002    
European                       3.497e+01  2.282e+01   1.532 0.125997    
Vegetarian                     9.438e+00  2.105e+01   0.448 0.654040    
Vegan                         -5.271e+00  2.450e+01  -0.215 0.829749    
Mediterranean                 -2.787e+01  2.630e+01  -1.059 0.289842    
Asian                         -2.188e+01  2.764e+01  -0.791 0.429078    
Gluten_free                    1.249e+02  2.891e+01   4.321 1.85e-05 ***
Spanish                       -3.772e+01  5.557e+01  -0.679 0.497614    
Swiss                          1.538e+01  2.977e+01   0.516 0.605740    
Lunch                          1.478e+01  3.449e+01   0.428 0.668510    
Dinner                        -9.685e+00  3.563e+01  -0.272 0.785880    
Drinks                        -6.303e+01  2.065e+01  -3.052 0.002383 ** 
Brunch                         7.303e+01  3.311e+01   2.205 0.027848 *  
Breakfast                     -1.108e+02  2.746e+01  -4.035 6.25e-05 ***
Late_Night_Drinks              6.781e+01  2.879e+01   2.355 0.018876 *  
log(Distance_to_trainstation)  1.105e+01  1.698e+01   0.651 0.515380    
log(Distance_nearestparking)  -1.149e+01  1.320e+01  -0.870 0.384651    
log(Distance_neareststop)     -1.705e+01  1.320e+01  -1.292 0.197000    
log(Distance_to_jet)           2.282e+01  2.217e+01   1.029 0.303761    
log(Distance_to_catedral)     -5.663e+01  1.852e+01  -3.057 0.002342 ** 
log(Distance_to_nationpalace) -4.142e+01  3.104e+01  -1.335 0.182581    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 206.8 on 545 degrees of freedom
Multiple R-squared:  0.3491,    Adjusted R-squared:  0.318 
F-statistic: 11.24 on 26 and 545 DF,  p-value: < 2.2e-16

The model has an overall good fit with a Multiple R-squared of 0.318, indicating that approximately 31.8% of the variance in the number of reviews is explained by the included variables. The p-value (< 2.2e-16) of the F-statistic suggests that the model is statistically significant.
We then decided to apply the backward induction:

Code
###Backward induction

null_model <- lm(Number_of_reviews ~ 1, data = Bigdata)
final_model <- step(completemodel1, scope = list(lower = null_model, upper = completemodel), direction = "backward")
Code
summary(final_model)

Call:
lm(formula = Number_of_reviews ~ rankingPosition + OpenedHours + 
    averaged_score_competition + European + Gluten_free + Drinks + 
    Brunch + Breakfast + Late_Night_Drinks + log(Distance_to_catedral), 
    data = Bigdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-351.78  -94.79  -22.78   52.42 1861.68 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                1075.3942   175.0917   6.142 1.55e-09 ***
rankingPosition              -0.3892     0.0451  -8.629  < 2e-16 ***
OpenedHours                   7.3458     1.8758   3.916 0.000101 ***
averaged_score_competition -125.5517    40.9100  -3.069 0.002252 ** 
European                     50.9877    17.7336   2.875 0.004191 ** 
Gluten_free                 121.7124    26.3883   4.612 4.94e-06 ***
Drinks                      -64.0061    20.2081  -3.167 0.001622 ** 
Brunch                       70.1870    32.3550   2.169 0.030481 *  
Breakfast                  -102.1132    26.2873  -3.885 0.000115 ***
Late_Night_Drinks            63.3349    27.8000   2.278 0.023088 *  
log(Distance_to_catedral)   -48.8540    13.5522  -3.605 0.000340 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 205.4 on 561 degrees of freedom
Multiple R-squared:  0.3393,    Adjusted R-squared:  0.3275 
F-statistic: 28.81 on 10 and 561 DF,  p-value: < 2.2e-16

By applying backward selection based on AIC, we are told that some of the variables are rejected. Our final regression is then :

Number of Reviews =\beta_0 + \beta_1*rankingPosition + \beta_2*OpenedHours \\+ \beta_{3}*ScoreCompetition + \beta_{4}*European + \beta_5*log(Distancecatedral) +\beta_{6}*Glutenfree+\beta_{7}*Breakfast + \beta_{8}*Brunch + \beta_{9}*Drinks + \\ \beta_{10}*LateNightDrinks


We check if there is a multi-collinearity issue:

Code
olsrr::ols_vif_tol(final_model) %>% kableExtra::kable(digits = 3) %>% kableExtra::kable_styling(c("striped", "bordered")) %>% kableExtra::scroll_box(width = "100%", height = "300px")
Variables Tolerance VIF
rankingPosition 0.799 1.251
OpenedHours 0.861 1.161
averaged_score_competition 0.913 1.095
European 0.949 1.054
Gluten_free 0.845 1.183
Drinks 0.734 1.363
Brunch 0.883 1.132
Breakfast 0.791 1.265
Late_Night_Drinks 0.754 1.326
log(Distance_to_catedral) 0.918 1.090

No one seems to have a severe issue since the VIF is below 5.

Code
forecast::accuracy(final_model) %>% tibble::as_tibble() %>% dplyr::select(RMSE, MAE, MASE) %>%
  kableExtra::kable(caption = "Accuracy of the Linear Model", align = 'c') %>%
  kableExtra::kable_styling(c("striped", "bordered"),
                full_width = FALSE,
                position = "center")
Accuracy of the Linear Model
RMSE MAE MASE
203.3935 110.6714 0.8318808
Code
lindia::gg_qqplot(final_model)


An RMSE of 203 indicates that, on average, the model’s predictions are off by approximately 203 units from the actual values. An MAE of 110 means that, on average, the model’s predictions deviate by approximately 110 units from the actual values.

3.5.2 Rating

We decided to calculate the following regression, which considers all the variables presented above, to determine the Rating :

Rating =\beta_0 + \beta_1* rating + \beta_2 *photoCount + \beta_3*rankingPosition + \\ \beta_4*OpenedHours + \beta_5*Distancetrainstation + \beta_6 * Distanceneareststop + \\ \beta_7*Distancecatedral + \beta_8*DistanceJet +\beta_9*DistanceNationPalace + \\ \beta_{10}*Scorecompetition + \beta_{11}*Averagedprice + \beta_{11}*French + \\ \beta_{12}*Italien + \beta_{13}*European + \beta_{14}*Mediterranean +\beta_{15}*Vegeterian + \beta_{17}*Vegan \\ + \beta_{18}*Asian + \beta_{19}*Glutenfree + \beta_{20}*Spanish + \beta_{21}*Swiss + \\ \beta_{22}*Lunch + \beta_{23}*Brunch + \beta_{24}*Breakfast + \beta_{25}*Dinner + \\ \beta_{26}*Drinks + \beta_{27}*LateNightDrinks

Code
completemodelrating <- lm(rating ~ Number_of_reviews + photoCount + rankingPosition + OpenedHours + averaged_score_competition+ averaged_price+ French + Italian + European +Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner +Drinks + Brunch + Breakfast + Late_Night_Drinks +log(Distance_to_trainstation)+ log(Distance_nearestparking) + log(Distance_neareststop) + log(Distance_to_jet)+ log(Distance_to_catedral) + log(Distance_to_nationpalace),Bigdata)
Code
summary(completemodelrating)

Call:
lm(formula = rating ~ Number_of_reviews + photoCount + rankingPosition + 
    OpenedHours + averaged_score_competition + averaged_price + 
    French + Italian + European + Vegetarian + Vegan + Mediterranean + 
    Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner + 
    Drinks + Brunch + Breakfast + Late_Night_Drinks + log(Distance_to_trainstation) + 
    log(Distance_nearestparking) + log(Distance_neareststop) + 
    log(Distance_to_jet) + log(Distance_to_catedral) + log(Distance_to_nationpalace), 
    data = Bigdata)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85644 -0.23262 -0.03209  0.23060  1.04832 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    4.270e+00  5.195e-01   8.219 1.52e-15 ***
Number_of_reviews             -6.614e-04  1.006e-04  -6.571 1.17e-10 ***
photoCount                     6.267e-04  2.457e-04   2.550  0.01103 *  
rankingPosition               -7.213e-04  8.750e-05  -8.244 1.26e-15 ***
OpenedHours                   -2.899e-03  3.384e-03  -0.857  0.39194    
averaged_score_competition     1.781e-01  8.512e-02   2.093  0.03686 *  
averaged_price                 2.215e-05  9.865e-06   2.245  0.02514 *  
French                         2.194e-02  4.059e-02   0.540  0.58912    
Italian                        6.713e-02  4.396e-02   1.527  0.12738    
European                      -1.255e-01  3.847e-02  -3.262  0.00118 ** 
Vegetarian                    -1.811e-01  3.540e-02  -5.117 4.32e-07 ***
Vegan                         -1.935e-02  4.120e-02  -0.470  0.63882    
Mediterranean                 -8.771e-02  4.427e-02  -1.981  0.04807 *  
Asian                         -7.776e-02  4.664e-02  -1.667  0.09609 .  
Gluten_free                   -4.510e-03  5.043e-02  -0.089  0.92877    
Spanish                        1.721e-01  9.348e-02   1.841  0.06610 .  
Swiss                          2.990e-02  5.009e-02   0.597  0.55081    
Lunch                         -1.085e-01  5.813e-02  -1.866  0.06255 .  
Dinner                        -5.289e-02  5.995e-02  -0.882  0.37803    
Drinks                         1.799e-02  3.520e-02   0.511  0.60954    
Brunch                         9.421e-02  5.597e-02   1.683  0.09291 .  
Breakfast                      5.990e-02  4.687e-02   1.278  0.20184    
Late_Night_Drinks             -2.669e-02  4.880e-02  -0.547  0.58468    
log(Distance_to_trainstation)  7.645e-03  2.856e-02   0.268  0.78903    
log(Distance_nearestparking)   5.243e-03  2.224e-02   0.236  0.81369    
log(Distance_neareststop)     -9.053e-03  2.222e-02  -0.407  0.68392    
log(Distance_to_jet)          -1.133e-02  3.737e-02  -0.303  0.76184    
log(Distance_to_catedral)      2.186e-02  3.151e-02   0.694  0.48814    
log(Distance_to_nationpalace) -3.169e-02  5.230e-02  -0.606  0.54483    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3478 on 543 degrees of freedom
Multiple R-squared:  0.2596,    Adjusted R-squared:  0.2215 
F-statistic: 6.801 on 28 and 543 DF,  p-value: < 2.2e-16


The regression model was fitted to predict the rating of a restaurant based on various features. The model shows that the number of reviews, ranking position, and several other factors significantly influence the restaurant’s rating. The overall model has an adjusted R-squared value of 0.2215, indicating that the included variables explain about 22.15% of the variability in the restaurant ratings.
Then we decided to make another regression by kicking out the Number of Reviews and photoCount variables:

Code
completemodelrating1 <- lm(rating ~rankingPosition + OpenedHours + averaged_score_competition+ averaged_price+ French + Italian + European +Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner +Drinks + Brunch + Breakfast + Late_Night_Drinks +log(Distance_to_trainstation)+ log(Distance_nearestparking) + log(Distance_neareststop) + log(Distance_to_jet)+ log(Distance_to_catedral) + log(Distance_to_nationpalace),Bigdata) 
Code
summary(completemodelrating1)

Call:
lm(formula = rating ~ rankingPosition + OpenedHours + averaged_score_competition + 
    averaged_price + French + Italian + European + Vegetarian + 
    Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss + 
    Lunch + Dinner + Drinks + Brunch + Breakfast + Late_Night_Drinks + 
    log(Distance_to_trainstation) + log(Distance_nearestparking) + 
    log(Distance_neareststop) + log(Distance_to_jet) + log(Distance_to_catedral) + 
    log(Distance_to_nationpalace), data = Bigdata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.00853 -0.23639 -0.03135  0.25110  1.05061 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    3.649e+00  5.325e-01   6.852 1.97e-11 ***
rankingPosition               -5.846e-04  8.438e-05  -6.928 1.21e-11 ***
OpenedHours                   -6.844e-03  3.487e-03  -1.963 0.050167 .  
averaged_score_competition     2.263e-01  8.823e-02   2.564 0.010603 *  
averaged_price                 2.192e-05  1.030e-05   2.129 0.033726 *  
French                         2.296e-02  4.230e-02   0.543 0.587506    
Italian                        6.551e-02  4.589e-02   1.427 0.154021    
European                      -1.447e-01  4.007e-02  -3.611 0.000333 ***
Vegetarian                    -1.844e-01  3.696e-02  -4.991 8.09e-07 ***
Vegan                         -1.651e-02  4.302e-02  -0.384 0.701264    
Mediterranean                 -7.465e-02  4.618e-02  -1.616 0.106592    
Asian                         -7.622e-02  4.854e-02  -1.570 0.116934    
Gluten_free                   -3.940e-02  5.076e-02  -0.776 0.438000    
Spanish                        1.877e-01  9.757e-02   1.924 0.054857 .  
Swiss                          2.578e-02  5.228e-02   0.493 0.622145    
Lunch                         -1.254e-01  6.056e-02  -2.071 0.038823 *  
Dinner                        -4.367e-02  6.257e-02  -0.698 0.485488    
Drinks                         5.751e-02  3.626e-02   1.586 0.113309    
Brunch                         6.436e-02  5.814e-02   1.107 0.268808    
Breakfast                      1.106e-01  4.822e-02   2.294 0.022182 *  
Late_Night_Drinks             -6.888e-02  5.056e-02  -1.362 0.173614    
log(Distance_to_trainstation)  3.269e-03  2.981e-02   0.110 0.912722    
log(Distance_nearestparking)   1.322e-02  2.318e-02   0.570 0.568708    
log(Distance_neareststop)     -7.442e-04  2.317e-02  -0.032 0.974390    
log(Distance_to_jet)          -2.750e-02  3.893e-02  -0.706 0.480286    
log(Distance_to_catedral)      5.557e-02  3.252e-02   1.709 0.088067 .  
log(Distance_to_nationpalace) -8.148e-03  5.450e-02  -0.150 0.881207    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3631 on 545 degrees of freedom
Multiple R-squared:  0.1897,    Adjusted R-squared:  0.151 
F-statistic: 4.907 on 26 and 545 DF,  p-value: 1.579e-13


Then by following the same structure, we applied a backward selection based on AIC:

Code
###Backward induction

null_model <- lm(rating ~ 1, data = Bigdata)
final_modelrating <- step(completemodelrating1, scope = list(lower = null_model, upper = completemodelrating), direction = "backward")
Code
summary(final_modelrating)

Call:
lm(formula = rating ~ rankingPosition + OpenedHours + averaged_score_competition + 
    averaged_price + European + Vegetarian + Mediterranean + 
    Asian + Spanish + Lunch + Drinks + Breakfast + Late_Night_Drinks + 
    log(Distance_to_catedral), data = Bigdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9674 -0.2421 -0.0352  0.2436  1.0119 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 3.565e+00  3.085e-01  11.557  < 2e-16 ***
rankingPosition            -5.450e-04  7.727e-05  -7.053 5.22e-12 ***
OpenedHours                -7.259e-03  3.410e-03  -2.129 0.033710 *  
averaged_score_competition  2.160e-01  7.206e-02   2.998 0.002841 ** 
averaged_price              2.242e-05  1.019e-05   2.200 0.028252 *  
European                   -1.229e-01  3.398e-02  -3.617 0.000325 ***
Vegetarian                 -1.865e-01  3.396e-02  -5.494 6.00e-08 ***
Mediterranean              -6.913e-02  4.168e-02  -1.658 0.097796 .  
Asian                      -1.021e-01  4.613e-02  -2.214 0.027200 *  
Spanish                     1.760e-01  9.578e-02   1.837 0.066672 .  
Lunch                      -1.253e-01  5.797e-02  -2.162 0.031070 *  
Drinks                      6.282e-02  3.542e-02   1.774 0.076643 .  
Breakfast                   1.296e-01  4.484e-02   2.891 0.003989 ** 
Late_Night_Drinks          -7.017e-02  4.939e-02  -1.421 0.155899    
log(Distance_to_catedral)   4.342e-02  2.398e-02   1.810 0.070779 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3609 on 557 degrees of freedom
Multiple R-squared:  0.1819,    Adjusted R-squared:  0.1614 
F-statistic: 8.847 on 14 and 557 DF,  p-value: < 2.2e-16

By applying backward selection based on AIC, we are told that some of the variables are rejected. Our final regression is then :

Rating =\beta_0 + \beta_1*rankingPosition + \beta_2*OpenedHours + \beta_3*log(Distancecatedral)+\\ \beta_{4}*Scorecompetition + \beta_{5}*Averagedprice + \beta_{6}*European + \beta_{7}*Mediterranean + \\ \beta_{8}*Vegeterian + \beta_{9}*Asian + \beta_{10}*Spanish + \beta_{11}*Lunch + \beta_{12}*Breakfast + \\ \beta_{13}*Drinks + \beta_{14}*LateNightDrinks


We check if there is a multi-collinearity issue:

Code
olsrr::ols_vif_tol(final_modelrating) %>% kableExtra::kable(digits = 3) %>% kableExtra::kable_styling(c("striped", "bordered")) %>% kableExtra::scroll_box(width = "100%", height = "300px")
Variables Tolerance VIF
rankingPosition 0.841 1.189
OpenedHours 0.805 1.242
averaged_score_competition 0.909 1.100
averaged_price 0.984 1.017
European 0.798 1.253
Vegetarian 0.819 1.221
Mediterranean 0.895 1.118
Asian 0.714 1.400
Spanish 0.972 1.029
Lunch 0.850 1.177
Drinks 0.738 1.355
Breakfast 0.839 1.192
Late_Night_Drinks 0.738 1.355
log(Distance_to_catedral) 0.905 1.105


No one seems to have a severe issue since the VIF is below 5.

Code
forecast::accuracy(final_modelrating) %>% tibble::as_tibble() %>% dplyr::select(RMSE, MAE, MASE) %>%
  kableExtra::kable(caption = "Accuracy of the Linear Model", align = 'c') %>%
  kableExtra::kable_styling(c("striped", "bordered"),
                full_width = FALSE,
                position = "center")
Accuracy of the Linear Model
RMSE MAE MASE
0.3561674 0.288294 0.8448291
Code
lindia::gg_qqplot(final_modelrating)


An RMSE of 0.3561674 indicates that, on average, the model’s predictions are off by approximately 0.356 units from the actual values. An MAE of 0.288294 means that, on average, the model’s predictions deviate by approximately 0.288 units from the actual values.

3.5.3 Predictive modeling with Number of Reviews

The predictive modeling is a robust technique for assessing the performance of a model by partitioning the dataset into K subsets. This method provides a more comprehensive evaluation, reducing the risk of overfitting or underfitting, and offers a more reliable estimate of the model’s generalization performance on unseen data.

LOOCV

Code
#LOOCV
train.control <- trainControl(method = "LOOCV")

# Train the model
model_lo <- Bigdata %>% 
  train(Number_of_reviews ~.,., method = "lm", trControl = train.control)

# Summarize the results
#print(model_lo)
summary(model_lo)

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-911.99  -47.45   -4.03   34.96 1416.47 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 5.304e+02  2.557e+02   2.074 0.038512 *  
rating                     -1.144e+02  1.794e+01  -6.379 3.85e-10 ***
photoCount                  1.640e+00  8.331e-02  19.683  < 2e-16 ***
rankingPosition            -1.341e-01  5.699e-02  -2.354 0.018929 *  
rawRanking                  1.729e+01  4.338e+01   0.399 0.690415    
Distance_to_trainstation    7.842e-03  4.760e-02   0.165 0.869195    
Distance_nearestparking    -4.097e-02  3.423e-02  -1.197 0.231864    
Distance_neareststop        2.768e-02  4.634e-02   0.597 0.550499    
Distance_to_jet             7.628e-02  4.871e-02   1.566 0.117928    
Distance_to_catedral       -1.429e-01  4.389e-02  -3.255 0.001206 ** 
Distance_to_patekmuseum     7.391e-02  3.320e-02   2.227 0.026393 *  
Distance_to_botanicgarden   5.584e-02  9.460e-02   0.590 0.555254    
Distance_to_nationpalace   -8.497e-02  2.153e-01  -0.395 0.693314    
Distance_to_brokenchair     1.205e-02  1.896e-01   0.064 0.949354    
averaged_score_competition -8.210e+00  4.014e+01  -0.205 0.838015    
French                     -1.024e+01  1.681e+01  -0.609 0.542960    
Italian                     4.867e+00  1.824e+01   0.267 0.789688    
European                    6.199e+00  1.610e+01   0.385 0.700314    
Vegetarian                 -1.647e+01  1.494e+01  -1.102 0.270779    
Vegan                      -7.922e+00  1.698e+01  -0.466 0.641099    
Mediterranean              -2.044e+01  1.825e+01  -1.120 0.263282    
Asian                       4.956e+00  1.929e+01   0.257 0.797287    
Gluten_free                -8.758e+00  2.101e+01  -0.417 0.677007    
Spanish                     5.991e+00  3.866e+01   0.155 0.876913    
Swiss                       4.395e+00  2.076e+01   0.212 0.832388    
Lunch                       1.545e+01  2.408e+01   0.642 0.521343    
Drinks                     -5.037e+01  1.435e+01  -3.509 0.000487 ***
Brunch                      3.278e+01  2.324e+01   1.411 0.158934    
Breakfast                  -3.708e+01  1.926e+01  -1.925 0.054797 .  
Dinner                     -2.317e+01  2.476e+01  -0.936 0.349744    
Late_Night_Drinks           5.161e+01  2.001e+01   2.579 0.010164 *  
averaged_price              1.809e-03  4.086e-03   0.443 0.658077    
OpenedHours                 4.543e+00  1.379e+00   3.294 0.001053 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 143.3 on 539 degrees of freedom
Multiple R-squared:  0.6911,    Adjusted R-squared:  0.6728 
F-statistic: 37.69 on 32 and 539 DF,  p-value: < 2.2e-16

K-Fold Cross Validation

Code
set.seed(123) 
train.control1 <- trainControl(method = "cv", number = 10)

# Train the model
model_k <- Bigdata  %>% 
  train(Number_of_reviews ~., ., method = "lm", trControl = train.control1)

# Summarize the results
#print(model_k)
summary(model_k)

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-911.99  -47.45   -4.03   34.96 1416.47 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 5.304e+02  2.557e+02   2.074 0.038512 *  
rating                     -1.144e+02  1.794e+01  -6.379 3.85e-10 ***
photoCount                  1.640e+00  8.331e-02  19.683  < 2e-16 ***
rankingPosition            -1.341e-01  5.699e-02  -2.354 0.018929 *  
rawRanking                  1.729e+01  4.338e+01   0.399 0.690415    
Distance_to_trainstation    7.842e-03  4.760e-02   0.165 0.869195    
Distance_nearestparking    -4.097e-02  3.423e-02  -1.197 0.231864    
Distance_neareststop        2.768e-02  4.634e-02   0.597 0.550499    
Distance_to_jet             7.628e-02  4.871e-02   1.566 0.117928    
Distance_to_catedral       -1.429e-01  4.389e-02  -3.255 0.001206 ** 
Distance_to_patekmuseum     7.391e-02  3.320e-02   2.227 0.026393 *  
Distance_to_botanicgarden   5.584e-02  9.460e-02   0.590 0.555254    
Distance_to_nationpalace   -8.497e-02  2.153e-01  -0.395 0.693314    
Distance_to_brokenchair     1.205e-02  1.896e-01   0.064 0.949354    
averaged_score_competition -8.210e+00  4.014e+01  -0.205 0.838015    
French                     -1.024e+01  1.681e+01  -0.609 0.542960    
Italian                     4.867e+00  1.824e+01   0.267 0.789688    
European                    6.199e+00  1.610e+01   0.385 0.700314    
Vegetarian                 -1.647e+01  1.494e+01  -1.102 0.270779    
Vegan                      -7.922e+00  1.698e+01  -0.466 0.641099    
Mediterranean              -2.044e+01  1.825e+01  -1.120 0.263282    
Asian                       4.956e+00  1.929e+01   0.257 0.797287    
Gluten_free                -8.758e+00  2.101e+01  -0.417 0.677007    
Spanish                     5.991e+00  3.866e+01   0.155 0.876913    
Swiss                       4.395e+00  2.076e+01   0.212 0.832388    
Lunch                       1.545e+01  2.408e+01   0.642 0.521343    
Drinks                     -5.037e+01  1.435e+01  -3.509 0.000487 ***
Brunch                      3.278e+01  2.324e+01   1.411 0.158934    
Breakfast                  -3.708e+01  1.926e+01  -1.925 0.054797 .  
Dinner                     -2.317e+01  2.476e+01  -0.936 0.349744    
Late_Night_Drinks           5.161e+01  2.001e+01   2.579 0.010164 *  
averaged_price              1.809e-03  4.086e-03   0.443 0.658077    
OpenedHours                 4.543e+00  1.379e+00   3.294 0.001053 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 143.3 on 539 degrees of freedom
Multiple R-squared:  0.6911,    Adjusted R-squared:  0.6728 
F-statistic: 37.69 on 32 and 539 DF,  p-value: < 2.2e-16
Code
##we obtain a better RMSE and R-squared with the k-fold cross validation than LOOCV.

With 580 observations, we have a reasonably sized dataset for regression. Using 10-fold cross-validation strikes a good balance between having sufficient data in each fold for training and validation while still providing a reasonable estimate of model performance. The fact that the number of reviews ranges from 10 to 2,200 suggests some variability in our target variable. Using a higher k (like 10) can help ensure that the entire range of the target variable is represented in both training and validation sets across the folds. Then, we compared our results with the methods as CV and LOOCV. We can see a lower R^2_a for our two training models done with LOOCV (Leave One Out Cross Validation) and KCV (K-Cross Validation). We also have a better RMSE with our lineal model. We can conclude that we have some overfitting in our model, because the prediction data is worst than our initial one.

4 Exploratory Analysis

4.1 Exploring Multiple Regression

4.1.1 Number of Reviews

Code
model3 <- lm(Number_of_reviews ~rankingPosition + OpenedHours + averaged_score_competition * averaged_price+ French + Italian + European +Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner +Drinks + Brunch + Breakfast*log(Distance_to_trainstation) + Late_Night_Drinks + log(Distance_nearestparking) + log(Distance_neareststop) + log(Distance_to_jet)+ log(Distance_to_catedral) + log(Distance_to_nationpalace),Bigdata)
Code
summary(model3)

Call:
lm(formula = Number_of_reviews ~ rankingPosition + OpenedHours + 
    averaged_score_competition * averaged_price + French + Italian + 
    European + Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + 
    Spanish + Swiss + Lunch + Dinner + Drinks + Brunch + Breakfast * 
    log(Distance_to_trainstation) + Late_Night_Drinks + log(Distance_nearestparking) + 
    log(Distance_neareststop) + log(Distance_to_jet) + log(Distance_to_catedral) + 
    log(Distance_to_nationpalace), data = Bigdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-342.85  -97.43  -18.83   52.74 1823.91 

Coefficients:
                                            Estimate Std. Error t value
(Intercept)                               1250.85230  350.83401   3.565
rankingPosition                             -0.38406    0.04811  -7.982
OpenedHours                                  6.85513    1.99492   3.436
averaged_score_competition                 -95.43806   65.60948  -1.455
averaged_price                               1.60597    2.06135   0.779
French                                       9.08837   24.12488   0.377
Italian                                      7.19121   26.15266   0.275
European                                    38.50535   22.94541   1.678
Vegetarian                                   9.88746   21.05787   0.470
Vegan                                       -7.08545   24.53101  -0.289
Mediterranean                              -26.21663   26.32632  -0.996
Asian                                      -20.93725   27.64762  -0.757
Gluten_free                                123.97509   28.95646   4.281
Spanish                                    -42.99777   56.10955  -0.766
Swiss                                       17.32762   29.80252   0.581
Lunch                                       15.30256   34.49388   0.444
Dinner                                      -1.15438   36.09453  -0.032
Drinks                                     -61.05951   20.68942  -2.951
Brunch                                      75.73020   33.15977   2.284
Breakfast                                 -409.66118  244.46837  -1.676
log(Distance_to_trainstation)                4.36497   17.68657   0.247
Late_Night_Drinks                           69.11481   28.82954   2.397
log(Distance_nearestparking)               -10.64925   13.22015  -0.806
log(Distance_neareststop)                  -17.27712   13.19577  -1.309
log(Distance_to_jet)                        21.61045   22.18217   0.974
log(Distance_to_catedral)                  -56.14928   18.55688  -3.026
log(Distance_to_nationpalace)              -38.50518   31.10007  -1.238
averaged_score_competition:averaged_price   -0.37752    0.48504  -0.778
Breakfast:log(Distance_to_trainstation)     44.16263   35.80339   1.233
                                          Pr(>|t|)    
(Intercept)                               0.000395 ***
rankingPosition                           8.59e-15 ***
OpenedHours                               0.000635 ***
averaged_score_competition                0.146347    
averaged_price                            0.436267    
French                                    0.706527    
Italian                                   0.783444    
European                                  0.093897 .  
Vegetarian                                0.638874    
Vegan                                     0.772817    
Mediterranean                             0.319775    
Asian                                     0.449205    
Gluten_free                               2.20e-05 ***
Spanish                                   0.443820    
Swiss                                     0.561202    
Lunch                                     0.657486    
Dinner                                    0.974498    
Drinks                                    0.003302 ** 
Brunch                                    0.022769 *  
Breakfast                                 0.094368 .  
log(Distance_to_trainstation)             0.805159    
Late_Night_Drinks                         0.016850 *  
log(Distance_nearestparking)              0.420865    
log(Distance_neareststop)                 0.190989    
log(Distance_to_jet)                      0.330378    
log(Distance_to_catedral)                 0.002597 ** 
log(Distance_to_nationpalace)             0.216212    
averaged_score_competition:averaged_price 0.436722    
Breakfast:log(Distance_to_trainstation)   0.217932    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 206.8 on 543 degrees of freedom
Multiple R-squared:  0.3517,    Adjusted R-squared:  0.3182 
F-statistic: 10.52 on 28 and 543 DF,  p-value: < 2.2e-16

We intended to make interact the Averaged score of the competition with the averaged price of a restaurant. In addition, we also tried to observe a different impact by making interact Breakfast with the Distance to the train station, assuming a highest people influence in the morning, for the restaurant close to the station.

Code
bigmodel2 <- Bigdata %>% 
  lm(Number_of_reviews~ Distance_to_trainstation +Distance_nearestparking + Distance_neareststop + Distance_to_jet
     + Distance_to_catedral +Distance_to_patekmuseum +Distance_to_botanicgarden + Distance_to_nationpalace +
       Distance_to_brokenchair,.) 
Code
summary(bigmodel2)

Call:
lm(formula = Number_of_reviews ~ Distance_to_trainstation + Distance_nearestparking + 
    Distance_neareststop + Distance_to_jet + Distance_to_catedral + 
    Distance_to_patekmuseum + Distance_to_botanicgarden + Distance_to_nationpalace + 
    Distance_to_brokenchair, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-259.76 -112.85  -54.05   26.71 2074.40 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)   
(Intercept)               168.16440  126.98306   1.324  0.18594   
Distance_to_trainstation   -0.11278    0.06722  -1.678  0.09392 . 
Distance_nearestparking     0.01925    0.05782   0.333  0.73932   
Distance_neareststop       -0.09056    0.07765  -1.166  0.24401   
Distance_to_jet             0.11402    0.08072   1.413  0.15836   
Distance_to_catedral       -0.21738    0.07112  -3.057  0.00235 **
Distance_to_patekmuseum     0.14825    0.04888   3.033  0.00253 **
Distance_to_botanicgarden   0.18644    0.15451   1.207  0.22807   
Distance_to_nationpalace   -0.50554    0.34353  -1.472  0.14169   
Distance_to_brokenchair     0.35832    0.30271   1.184  0.23702   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 245.8 on 562 degrees of freedom
Multiple R-squared:  0.05202,   Adjusted R-squared:  0.03684 
F-statistic: 3.427 on 9 and 562 DF,  p-value: 0.0004039

By making our different models, we’ve started by all the distances variables and had seen that some of them were significant. By adding variables, step by step all the distances ones started to not be significant anymore. We wanted then to make a specific model based on them.

Code
bigmodel2.2 <- Bigdata %>% 
  lm(Number_of_reviews~ French + Italian + European +
       Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss,.) 
Code
summary(bigmodel2.2)

Call:
lm(formula = Number_of_reviews ~ French + Italian + European + 
    Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + 
    Spanish + Swiss, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-405.28  -91.24  -39.51   23.41 2200.34 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     61.034     20.574   2.967  0.00314 ** 
French          23.295     26.635   0.875  0.38216    
Italian         -1.905     28.660  -0.066  0.94703    
European        69.330     24.848   2.790  0.00545 ** 
Vegetarian      54.524     22.844   2.387  0.01733 *  
Vegan           19.976     26.909   0.742  0.45818    
Mediterranean  -18.783     29.082  -0.646  0.51862    
Asian          -38.975     29.513  -1.321  0.18717    
Gluten_free    204.040     30.958   6.591 1.01e-10 ***
Spanish        -50.739     61.589  -0.824  0.41039    
Swiss           23.086     32.875   0.702  0.48283    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 231.6 on 561 degrees of freedom
Multiple R-squared:  0.1599,    Adjusted R-squared:  0.1449 
F-statistic: 10.68 on 10 and 561 DF,  p-value: < 2.2e-16

Following the same logic, we then started to isolate each category of variables to see their impact individually. In that way we made the following model based on the Cuisine variables

Code
##With some interaction 

bigmodel2.2 <- Bigdata %>% 
  lm(Number_of_reviews~ French + Italian*Dinner + European +
       Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss,.) 
#summary(bigmodel2.2)

bigmodel2.3 <- Bigdata %>% 
  lm(Number_of_reviews~ Italian*Dinner + European*French +
       Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss,.) 

#summary(bigmodel2.3)

Here we tried to include an interaction to see if a different impact would be observed. In fact, we made interact Italian with Dinner and European with French.

Code
bigmodel4 <- Bigdata %>% 
  lm(Number_of_reviews ~rankingPosition + OpenedHours + averaged_score_competition+ French + Italian + European +Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner +Drinks + Brunch + Breakfast + Late_Night_Drinks +log(Distance_to_trainstation)+ log(Distance_nearestparking) + log(Distance_neareststop) + log(Distance_to_jet)+ log(Distance_to_catedral) + log(Distance_to_nationpalace),.) 

summary(bigmodel4)

Call:
lm(formula = Number_of_reviews ~ rankingPosition + OpenedHours + 
    averaged_score_competition + French + Italian + European + 
    Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + 
    Spanish + Swiss + Lunch + Dinner + Drinks + Brunch + Breakfast + 
    Late_Night_Drinks + log(Distance_to_trainstation) + log(Distance_nearestparking) + 
    log(Distance_neareststop) + log(Distance_to_jet) + log(Distance_to_catedral) + 
    log(Distance_to_nationpalace), data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-359.72  -96.22  -18.38   52.39 1814.80 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   1366.90523  302.98678   4.511 7.89e-06 ***
rankingPosition                 -0.38380    0.04799  -7.997 7.67e-15 ***
OpenedHours                      7.09796    1.98408   3.577 0.000378 ***
averaged_score_competition    -126.30324   50.20428  -2.516 0.012163 *  
French                          11.09698   24.06812   0.461 0.644935    
Italian                          8.63323   26.11518   0.331 0.741086    
European                        34.98067   22.80353   1.534 0.125608    
Vegetarian                       9.16936   21.00376   0.437 0.662604    
Vegan                           -5.36153   24.48033  -0.219 0.826721    
Mediterranean                  -27.77958   26.27677  -1.057 0.290892    
Asian                          -21.28139   27.52496  -0.773 0.439758    
Gluten_free                    124.95499   28.88586   4.326 1.81e-05 ***
Spanish                        -37.85744   55.52034  -0.682 0.495613    
Swiss                           15.40887   29.74699   0.518 0.604670    
Lunch                           15.00397   34.45116   0.436 0.663361    
Dinner                          -9.65566   35.60394  -0.271 0.786343    
Drinks                         -63.23221   20.61695  -3.067 0.002269 ** 
Brunch                          73.02830   33.08638   2.207 0.027715 *  
Breakfast                     -110.84424   27.43846  -4.040 6.12e-05 ***
Late_Night_Drinks               67.86674   28.76772   2.359 0.018669 *  
log(Distance_to_trainstation)   11.21674   16.94930   0.662 0.508390    
log(Distance_nearestparking)   -11.47226   13.19184  -0.870 0.384875    
log(Distance_neareststop)      -16.99712   13.18428  -1.289 0.197876    
log(Distance_to_jet)            22.34001   22.07307   1.012 0.311942    
log(Distance_to_catedral)      -56.49002   18.49767  -3.054 0.002369 ** 
log(Distance_to_nationpalace)  -41.35904   31.01229  -1.334 0.182880    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 206.6 on 546 degrees of freedom
Multiple R-squared:  0.349, Adjusted R-squared:  0.3192 
F-statistic: 11.71 on 25 and 546 DF,  p-value: < 2.2e-16

4.1.2 Rating

As we saw in the general model composed of all the variables, the distances variables don’t have an impact on rating. In this 4th part of our report, we decided to omit those variables which would have not bring any relevant information for the regression. Nevertheless, we decided to try a regression based on the different mealTypes.

Code
model4 <- lm(rating ~rankingPosition + OpenedHours + averaged_score_competition * averaged_price+ French + Italian + European +Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss + Lunch + Dinner +Drinks + Brunch + Breakfast*log(Distance_to_trainstation) + Late_Night_Drinks + log(Distance_nearestparking) + log(Distance_neareststop) + log(Distance_to_jet)+ log(Distance_to_catedral) + log(Distance_to_nationpalace),Bigdata)
Code
summary(model4)

Call:
lm(formula = rating ~ rankingPosition + OpenedHours + averaged_score_competition * 
    averaged_price + French + Italian + European + Vegetarian + 
    Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss + 
    Lunch + Dinner + Drinks + Brunch + Breakfast * log(Distance_to_trainstation) + 
    Late_Night_Drinks + log(Distance_nearestparking) + log(Distance_neareststop) + 
    log(Distance_to_jet) + log(Distance_to_catedral) + log(Distance_to_nationpalace), 
    data = Bigdata)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.99882 -0.24104 -0.02906  0.24420  1.05087 

Coefficients:
                                            Estimate Std. Error t value
(Intercept)                                3.8158658  0.6161996   6.193
rankingPosition                           -0.0005889  0.0000845  -6.969
OpenedHours                               -0.0069183  0.0035038  -1.974
averaged_score_competition                 0.1917746  0.1152355   1.664
averaged_price                            -0.0014835  0.0036205  -0.410
French                                     0.0204046  0.0423726   0.482
Italian                                    0.0643959  0.0459341   1.402
European                                  -0.1414654  0.0403010  -3.510
Vegetarian                                -0.1852601  0.0369857  -5.009
Vegan                                     -0.0181144  0.0430859  -0.420
Mediterranean                             -0.0739046  0.0462391  -1.598
Asian                                     -0.0753480  0.0485599  -1.552
Gluten_free                               -0.0375155  0.0508587  -0.738
Spanish                                    0.1948705  0.0985500   1.977
Swiss                                      0.0288602  0.0523447   0.551
Lunch                                     -0.1260464  0.0605845  -2.081
Dinner                                    -0.0343219  0.0633959  -0.541
Drinks                                     0.0594782  0.0363386   1.637
Brunch                                     0.0670524  0.0582413   1.151
Breakfast                                 -0.4471891  0.4293805  -1.041
log(Distance_to_trainstation)             -0.0074833  0.0310644  -0.241
Late_Night_Drinks                         -0.0698023  0.0506358  -1.379
log(Distance_nearestparking)               0.0148760  0.0232197   0.641
log(Distance_neareststop)                 -0.0010721  0.0231768  -0.046
log(Distance_to_jet)                      -0.0289317  0.0389604  -0.743
log(Distance_to_catedral)                  0.0581749  0.0325930   1.785
log(Distance_to_nationpalace)             -0.0043696  0.0546237  -0.080
averaged_score_competition:averaged_price  0.0003543  0.0008519   0.416
Breakfast:log(Distance_to_trainstation)    0.0821178  0.0628845   1.306
                                          Pr(>|t|)    
(Intercept)                               1.17e-09 ***
rankingPosition                           9.26e-12 ***
OpenedHours                               0.048834 *  
averaged_score_competition                0.096650 .  
averaged_price                            0.682151    
French                                    0.630318    
Italian                                   0.161511    
European                                  0.000485 ***
Vegetarian                                7.41e-07 ***
Vegan                                     0.674342    
Mediterranean                             0.110555    
Asian                                     0.121328    
Gluten_free                               0.461050    
Spanish                                   0.048504 *  
Swiss                                     0.581621    
Lunch                                     0.037947 *  
Dinner                                    0.588461    
Drinks                                    0.102256    
Brunch                                    0.250121    
Breakfast                                 0.298119    
log(Distance_to_trainstation)             0.809727    
Late_Night_Drinks                         0.168611    
log(Distance_nearestparking)              0.522010    
log(Distance_neareststop)                 0.963121    
log(Distance_to_jet)                      0.458050    
log(Distance_to_catedral)                 0.074838 .  
log(Distance_to_nationpalace)             0.936272    
averaged_score_competition:averaged_price 0.677685    
Breakfast:log(Distance_to_trainstation)   0.192156    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3632 on 543 degrees of freedom
Multiple R-squared:  0.1924,    Adjusted R-squared:  0.1508 
F-statistic: 4.621 on 28 and 543 DF,  p-value: 3.769e-13

We wanted to see a possible interaction between the averaged score of the competition and the averaged price. In addition ,we also made an interaction between Italian and European because we have some doubts about possible impact’s overlap. We think that European could include all the different type of cuisine like French or Spanish for instance. In the other hand, an interaction between Brunch and Breakfast or OpenedHours could also be interesting.

Code
model5 <- Bigdata %>% 
  lm(rating~ French + Italian + European +
       Vegetarian + Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss,.) 
Code
summary(model5)

Call:
lm(formula = rating ~ French + Italian + European + Vegetarian + 
    Vegan + Mediterranean + Asian + Gluten_free + Spanish + Swiss, 
    data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.23239 -0.25008 -0.09473  0.27848  0.89136 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.390453   0.033986 129.185  < 2e-16 ***
French         0.033781   0.043999   0.768  0.44295    
Italian        0.010058   0.047343   0.212  0.83184    
European      -0.113691   0.041046  -2.770  0.00579 ** 
Vegetarian    -0.168123   0.037736  -4.455 1.01e-05 ***
Vegan          0.027755   0.044451   0.624  0.53262    
Mediterranean -0.070264   0.048042  -1.463  0.14415    
Asian         -0.083995   0.048753  -1.723  0.08546 .  
Gluten_free    0.039285   0.051141   0.768  0.44271    
Spanish        0.190014   0.101740   1.868  0.06233 .  
Swiss         -0.005519   0.054306  -0.102  0.91909    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3826 on 561 degrees of freedom
Multiple R-squared:  0.07432,   Adjusted R-squared:  0.05782 
F-statistic: 4.504 on 10 and 561 DF,  p-value: 3.835e-06
Code
model6 <- Bigdata %>% 
  lm(rating~ Lunch + Drinks + Brunch + Breakfast + Dinner + Late_Night_Drinks,.) 
Code
summary(model6)

Call:
lm(formula = rating ~ Lunch + Drinks + Brunch + Breakfast + Dinner + 
    Late_Night_Drinks, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1893 -0.2355 -0.1893  0.3104  0.8107 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        4.41123    0.07404  59.577  < 2e-16 ***
Lunch             -0.15865    0.06086  -2.607  0.00938 ** 
Drinks             0.07394    0.03813   1.939  0.05300 .  
Brunch             0.04590    0.06115   0.751  0.45321    
Breakfast          0.04075    0.04815   0.846  0.39780    
Dinner            -0.06297    0.06490  -0.970  0.33234    
Late_Night_Drinks -0.07430    0.05298  -1.402  0.16136    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3903 on 565 degrees of freedom
Multiple R-squared:  0.02985,   Adjusted R-squared:  0.01955 
F-statistic: 2.897 on 6 and 565 DF,  p-value: 0.008639

5. Predictor

Here is the link toward our predictor that we made with a shiny app:

Click

6. Recommendations

Future work

  • Time Series Analysis: Having temporal dara to analyze trends over time. This could reveal seasonal variations or long-term changes in restaurant popularity and customer preferences.

  • Customer Sentiment Analysis: Get textual reviews to conduct sentiment analysis. It could provide insights into what customers particularly like or dislike about restaurants.

  • Competitive Analysis: Compare restaurants in Geneva with those in other cities or regions to identify unique trends or competitive advantages.

  • Economic Impact Analysis: Explore how changes in the restaurant industry (like new openings, closures, changes in ratings) correlate with economic indicators in Geneva.

  • Sustainability and Dietary Trends: Examine trends related to sustainability practices and the popularity of various dietary preferences, like an increasing number of restaurant proposing vegan, gluten-free, etc.